Factor Analysis applied in Multivariate Regression: Machine Learning’s Perspective

Biostatistics Machine Learning PCA EFA Regression

FA vs. PCA;
Conducting an Exploratory Factor Analysis Example;
Factor Score;
FA applied in Regression;

Hai Nguyen
September 30, 2022

Questions

1- What is difference between EFA and PCA?
2- Is it possible that we can run the regression analysis on factors generating from Factor Analysis?
3- If so, how can we run it?

The demonstration is to use the dataset Factor-Hair-Revised.csv to build a regression model to predict satisfaction.

EFA vs. PCA

FA PCA
Meaning A factor (or latent) is a common or underlying element with which several other variables are correlated
\(X_1 = L_1\times F + \epsilon_1\)
\(X_2 = L_2\times F + \epsilon_2\)
\(X_3 = L_3\times F + \epsilon_3\) where, F is the factor, W is are the loading
A component is a derived new dimension (or variable) so that the derived variables are linearly independent of each other
\(Y = W_1\times PC_1 + W_2\times PC_2 + ...+ W_{10}\times PC_{10} + C\) where, PCs are the components & W is are the weights
Process The original variables are defined as the linear combinations of the factors The components are calculated as the linear combinations of the original variables
Objective FA focuses on explaining the covariances or the correlations between the variables The aim of PCA is to explain as much of the cumulative variance in the predictors (or variables) as possible
Purpose Factor Analysis is used to understand the underlying ‘cause’ which these factors (latent or constituents) capture much of the information of a set of variables in the dataset data PCA is used to decompose the data into a smaller number of components and therefore is a type of Singular Value Decomposition (SVD)
Interpretation of coefficients The coefficients (loadings) in the factor analysis express the relationship or association of each variable (X) to the underlying factor (F) The coefficients indicate which component contributes more to the target variable, Y, as the independent variables are standardized

An Example: Data Exploration

prodsurvey <- read_csv("Factor-Hair-Revised.csv")
head(prodsurvey)
# A tibble: 6 × 13
     ID ProdQual  Ecom TechSup CompRes Adver…¹ ProdL…² Sales…³ ComPr…⁴
  <dbl>    <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1      8.5   3.9     2.5     5.9     4.8     4.9     6       6.8
2     2      8.2   2.7     5.1     7.2     3.4     7.9     3.1     5.3
3     3      9.2   3.4     5.6     5.6     5.4     7.4     5.8     4.5
4     4      6.4   3.3     7       3.7     4.7     4.7     4.5     8.8
5     5      9     3.4     5.2     4.6     2.2     6       4.5     6.8
6     6      6.5   2.8     3.1     4.1     4       4.3     3.7     8.5
# … with 4 more variables: WartyClaim <dbl>, OrdBilling <dbl>,
#   DelSpeed <dbl>, Satisfaction <dbl>, and abbreviated variable
#   names ¹​Advertising, ²​ProdLine, ³​SalesFImage, ⁴​ComPricing
dim(prodsurvey)
[1] 100  13
str(prodsurvey)
spc_tbl_ [100 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID          : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
 $ ProdQual    : num [1:100] 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
 $ Ecom        : num [1:100] 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
 $ TechSup     : num [1:100] 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
 $ CompRes     : num [1:100] 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
 $ Advertising : num [1:100] 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
 $ ProdLine    : num [1:100] 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
 $ SalesFImage : num [1:100] 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
 $ ComPricing  : num [1:100] 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
 $ WartyClaim  : num [1:100] 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
 $ OrdBilling  : num [1:100] 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
 $ DelSpeed    : num [1:100] 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
 $ Satisfaction: num [1:100] 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   ProdQual = col_double(),
  ..   Ecom = col_double(),
  ..   TechSup = col_double(),
  ..   CompRes = col_double(),
  ..   Advertising = col_double(),
  ..   ProdLine = col_double(),
  ..   SalesFImage = col_double(),
  ..   ComPricing = col_double(),
  ..   WartyClaim = col_double(),
  ..   OrdBilling = col_double(),
  ..   DelSpeed = col_double(),
  ..   Satisfaction = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
names(prodsurvey)
 [1] "ID"           "ProdQual"     "Ecom"         "TechSup"     
 [5] "CompRes"      "Advertising"  "ProdLine"     "SalesFImage" 
 [9] "ComPricing"   "WartyClaim"   "OrdBilling"   "DelSpeed"    
[13] "Satisfaction"
describe(prodsurvey)
             vars   n  mean    sd median trimmed   mad min   max
ID              1 100 50.50 29.01  50.50   50.50 37.06 1.0 100.0
ProdQual        2 100  7.81  1.40   8.00    7.85  1.78 5.0  10.0
Ecom            3 100  3.67  0.70   3.60    3.63  0.52 2.2   5.7
TechSup         4 100  5.36  1.53   5.40    5.40  1.85 1.3   8.5
CompRes         5 100  5.44  1.21   5.45    5.46  1.26 2.6   7.8
Advertising     6 100  4.01  1.13   4.00    4.00  1.19 1.9   6.5
ProdLine        7 100  5.80  1.32   5.75    5.81  1.56 2.3   8.4
SalesFImage     8 100  5.12  1.07   4.90    5.09  0.89 2.9   8.2
ComPricing      9 100  6.97  1.55   7.10    7.02  1.93 3.7   9.9
WartyClaim     10 100  6.04  0.82   6.10    6.04  0.89 4.1   8.1
OrdBilling     11 100  4.28  0.93   4.40    4.31  0.74 2.0   6.7
DelSpeed       12 100  3.89  0.73   3.90    3.92  0.74 1.6   5.5
Satisfaction   13 100  6.92  1.19   7.05    6.90  1.33 4.7   9.9
             range  skew kurtosis   se
ID            99.0  0.00    -1.24 2.90
ProdQual       5.0 -0.24    -1.17 0.14
Ecom           3.5  0.64     0.57 0.07
TechSup        7.2 -0.20    -0.63 0.15
CompRes        5.2 -0.13    -0.66 0.12
Advertising    4.6  0.04    -0.94 0.11
ProdLine       6.1 -0.09    -0.60 0.13
SalesFImage    5.3  0.37     0.26 0.11
ComPricing     6.2 -0.23    -0.96 0.15
WartyClaim     4.0  0.01    -0.53 0.08
OrdBilling     4.7 -0.32     0.11 0.09
DelSpeed       3.9 -0.45     0.09 0.07
Satisfaction   5.2  0.08    -0.86 0.12
datamatrix1 <- cor(prodsurvey[,-1])
corrplot(datamatrix1, method="number")

As we can see from the above correlation matrix:

  1. Complaint resolution (CompRes) and delivery speed (DelSpeed) are highly correlated
  2. Order billing (OrdBilling) and CompRes are highly correlated
  3. WartyClaim and TechSupport are highly correlated
  4. OrdBilling and DelSpeed are highly correlated
  5. e-commerce (Ecom) and sales force image (SalesFImage) are highly correlated
pcor(prodsurvey[,-1], method="pearson")
$estimate
                 ProdQual          Ecom      TechSup      CompRes
ProdQual      1.000000000  0.1549597037  0.002424222 -0.056354142
Ecom          0.154959704  1.0000000000  0.082359000 -0.033513777
TechSup       0.002424222  0.0823590001  1.000000000  0.143603415
CompRes      -0.056354142 -0.0335137768  0.143603415  1.000000000
Advertising   0.112376746 -0.0002972504 -0.059300254 -0.064806093
ProdLine      0.281144724  0.1538660545 -0.125349050  0.020305650
SalesFImage  -0.376228551  0.7321011890 -0.093310796 -0.022150933
ComPricing   -0.014021386  0.0149131857 -0.132972485 -0.004487151
WartyClaim   -0.042752416 -0.1232553822  0.787729606 -0.109686211
OrdBilling    0.054523215  0.1546990521 -0.165914724  0.288344382
DelSpeed     -0.335084210 -0.0083917930 -0.021914916  0.528932328
Satisfaction  0.607438787 -0.3308440834  0.055111867  0.172416347
               Advertising    ProdLine  SalesFImage   ComPricing
ProdQual      0.1123767461  0.28114472 -0.376228551 -0.014021386
Ecom         -0.0002972504  0.15386605  0.732101189  0.014913186
TechSup      -0.0593002540 -0.12534905 -0.093310796 -0.132972485
CompRes      -0.0648060931  0.02030565 -0.022150933 -0.004487151
Advertising   1.0000000000 -0.13192319  0.262879343 -0.063298038
ProdLine     -0.1319231866  1.00000000 -0.230170570 -0.361757904
SalesFImage   0.2628793429 -0.23017057  1.000000000  0.126612489
ComPricing   -0.0632980381 -0.36175790  0.126612489  1.000000000
WartyClaim    0.0275909587  0.25718360  0.189300271  0.020351554
OrdBilling   -0.0326580430 -0.28098068 -0.182359301 -0.086500869
DelSpeed      0.2046544064  0.50189505  0.005889925  0.190437993
Satisfaction -0.0449735411  0.18325156  0.660251850 -0.087467711
              WartyClaim  OrdBilling     DelSpeed Satisfaction
ProdQual     -0.04275242  0.05452322 -0.335084210   0.60743879
Ecom         -0.12325538  0.15469905 -0.008391793  -0.33084408
TechSup       0.78772961 -0.16591472 -0.021914916   0.05511187
CompRes      -0.10968621  0.28834438  0.528932328   0.17241635
Advertising   0.02759096 -0.03265804  0.204654406  -0.04497354
ProdLine      0.25718360 -0.28098068  0.501895051   0.18325156
SalesFImage   0.18930027 -0.18235930  0.005889925   0.66025185
ComPricing    0.02035155 -0.08650087  0.190437993  -0.08746771
WartyClaim    1.00000000  0.25984451 -0.091649002  -0.08868071
OrdBilling    0.25984451  1.00000000  0.350530212   0.14880055
DelSpeed     -0.09164900  0.35053021  1.000000000   0.08955614
Satisfaction -0.08868071  0.14880055  0.089556143   1.00000000

$p.value
                 ProdQual         Ecom      TechSup      CompRes
ProdQual     0.000000e+00 1.447435e-01 9.819081e-01 5.977987e-01
Ecom         1.447435e-01 0.000000e+00 4.402823e-01 7.538375e-01
TechSup      9.819081e-01 4.402823e-01 0.000000e+00 1.769171e-01
CompRes      5.977987e-01 7.538375e-01 1.769171e-01 0.000000e+00
Advertising  2.916350e-01 9.977814e-01 5.787601e-01 5.439511e-01
ProdLine     7.269030e-03 1.476353e-01 2.391187e-01 8.493380e-01
SalesFImage  2.576212e-04 2.442012e-16 3.817042e-01 8.358301e-01
ComPricing   8.956443e-01 8.890480e-01 2.115149e-01 9.665194e-01
WartyClaim   6.890839e-01 2.471182e-01 3.262868e-20 3.034147e-01
OrdBilling   6.097697e-01 1.454288e-01 1.180864e-01 5.850710e-03
DelSpeed     1.245192e-03 9.374303e-01 8.375553e-01 8.359069e-08
Satisfaction 2.182238e-10 1.447603e-03 6.059096e-01 1.041593e-01
             Advertising     ProdLine  SalesFImage   ComPricing
ProdQual      0.29163501 7.269030e-03 2.576212e-04 0.8956443272
Ecom          0.99778145 1.476353e-01 2.442012e-16 0.8890479591
TechSup       0.57876015 2.391187e-01 3.817042e-01 0.2115148546
CompRes       0.54395113 8.493380e-01 8.358301e-01 0.9665194173
Advertising   0.00000000 2.151737e-01 1.230672e-02 0.5533828069
ProdLine      0.21517368 0.000000e+00 2.907563e-02 0.0004593443
SalesFImage   0.01230672 2.907563e-02 0.000000e+00 0.2343791374
ComPricing    0.55338281 4.593443e-04 2.343791e-01 0.0000000000
WartyClaim    0.79629803 1.440249e-02 7.394541e-02 0.8490014635
OrdBilling    0.75993047 7.304606e-03 8.538047e-02 0.4175555780
DelSpeed      0.05300070 4.664278e-07 9.560619e-01 0.0721942454
Satisfaction  0.67382405 8.383625e-02 1.449719e-12 0.4123500152
               WartyClaim   OrdBilling     DelSpeed Satisfaction
ProdQual     6.890839e-01 0.6097697424 1.245192e-03 2.182238e-10
Ecom         2.471182e-01 0.1454287628 9.374303e-01 1.447603e-03
TechSup      3.262868e-20 0.1180864174 8.375553e-01 6.059096e-01
CompRes      3.034147e-01 0.0058507099 8.359069e-08 1.041593e-01
Advertising  7.962980e-01 0.7599304715 5.300070e-02 6.738241e-01
ProdLine     1.440249e-02 0.0073046065 4.664278e-07 8.383625e-02
SalesFImage  7.394541e-02 0.0853804743 9.560619e-01 1.449719e-12
ComPricing   8.490015e-01 0.4175555780 7.219425e-02 4.123500e-01
WartyClaim   0.000000e+00 0.0133877361 3.902770e-01 4.058729e-01
OrdBilling   1.338774e-02 0.0000000000 7.064114e-04 1.615975e-01
DelSpeed     3.902770e-01 0.0007064114 0.000000e+00 4.012356e-01
Satisfaction 4.058729e-01 0.1615974575 4.012356e-01 0.000000e+00

$statistic
                ProdQual         Ecom     TechSup     CompRes
ProdQual      0.00000000  1.471424516  0.02274128 -0.52949015
Ecom          1.47142452  0.000000000  0.77522957 -0.31456380
TechSup       0.02274128  0.775229571  0.00000000  1.36122814
CompRes      -0.52949015 -0.314563798  1.36122814  0.00000000
Advertising   1.06090746 -0.002788456 -0.55726637 -0.60921569
ProdLine      2.74821968  1.460787002 -1.18522655  0.19052316
SalesFImage  -3.80921125 10.081854496 -0.87916864 -0.20784517
ComPricing   -0.13154519  0.139913642 -1.25856891 -0.04209363
WartyClaim   -0.40142023 -1.165122048 11.99562484 -1.03519395
OrdBilling    0.51223505  1.468888754 -1.57829305  2.82489237
DelSpeed     -3.33624278 -0.078724768 -0.20562952  5.84663073
Satisfaction  7.17336519 -3.288799958  0.51778207  1.64199901
              Advertising   ProdLine SalesFImage  ComPricing
ProdQual      1.060907458  2.7482197 -3.80921125 -0.13154519
Ecom         -0.002788456  1.4607870 10.08185450  0.13991364
TechSup      -0.557266374 -1.1852266 -0.87916864 -1.25856891
CompRes      -0.609215688  0.1905232 -0.20784517 -0.04209363
Advertising   0.000000000 -1.2484608  2.55592188 -0.59498137
ProdLine     -1.248460807  0.0000000 -2.21876450 -3.64012829
SalesFImage   2.555921881 -2.2187645  0.00000000  1.19736653
ComPricing   -0.594981366 -3.6401283  1.19736653  0.00000000
WartyClaim    0.258924708  2.4965744  1.80849286  0.19095404
OrdBilling   -0.306523103 -2.7464786 -1.73985585 -0.81450302
DelSpeed      1.961341689  5.4434474  0.05525335  1.81976992
Satisfaction -0.422316520  1.7486638  8.24679932 -0.82367672
             WartyClaim OrdBilling    DelSpeed Satisfaction
ProdQual     -0.4014202  0.5122350 -3.33624278    7.1733652
Ecom         -1.1651220  1.4688888 -0.07872477   -3.2888000
TechSup      11.9956248 -1.5782930 -0.20562952    0.5177821
CompRes      -1.0351939  2.8248924  5.84663073    1.6419990
Advertising   0.2589247 -0.3065231  1.96134169   -0.4223165
ProdLine      2.4965744 -2.7464786  5.44344736    1.7486638
SalesFImage   1.8084929 -1.7398559  0.05525335    8.2467993
ComPricing    0.1909540 -0.8145030  1.81976992   -0.8236767
WartyClaim    0.0000000  2.5242649 -0.86337748   -0.8351894
OrdBilling    2.5242649  0.0000000  3.51103503    1.4115877
DelSpeed     -0.8633775  3.5110350  0.00000000    0.8435005
Satisfaction -0.8351894  1.4115877  0.84350046    0.0000000

$n
[1] 100

$gp
[1] 10

$method
[1] "pearson"

Multiple Linear Regression

model1 <- lm(Satisfaction ~ ., prodsurvey[,-1])
summary(model1)

Call:
lm(formula = Satisfaction ~ ., data = prodsurvey[, -1])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43005 -0.31165  0.07621  0.37190  0.90120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.66961    0.81233  -0.824  0.41199    
ProdQual     0.37137    0.05177   7.173 2.18e-10 ***
Ecom        -0.44056    0.13396  -3.289  0.00145 ** 
TechSup      0.03299    0.06372   0.518  0.60591    
CompRes      0.16703    0.10173   1.642  0.10416    
Advertising -0.02602    0.06161  -0.422  0.67382    
ProdLine     0.14034    0.08025   1.749  0.08384 .  
SalesFImage  0.80611    0.09775   8.247 1.45e-12 ***
ComPricing  -0.03853    0.04677  -0.824  0.41235    
WartyClaim  -0.10298    0.12330  -0.835  0.40587    
OrdBilling   0.14635    0.10367   1.412  0.16160    
DelSpeed     0.16570    0.19644   0.844  0.40124    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5623 on 88 degrees of freedom
Multiple R-squared:  0.8021,    Adjusted R-squared:  0.7774 
F-statistic: 32.43 on 11 and 88 DF,  p-value: < 2.2e-16

Variable Inflation Factor (VIF)

vif(model1)
   ProdQual        Ecom     TechSup     CompRes Advertising 
   1.635797    2.756694    2.976796    4.730448    1.508933 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
   3.488185    3.439420    1.635000    3.198337    2.902999 
   DelSpeed 
   6.516014 

Factor Analysis

X <- prodsurvey[,-c(1,13)]
y <- prodsurvey[,13]

Let’s check the factorability of the variables in the dataset: perform the Kaiser-Meyer-Olkin (KMO) Test.

Kaiser-Meyer-Olkin (KMO) Test and/or Bartlett’s Test is a measure of how suited your data is for Factor Analysis. The test measures sampling adequacy for each variable in the model and for the complete model. The statistic is a measure of the proportion of variance among variables that might be common variance. The lower the proportion, the more suited your data is to Factor Analysis.

datamatrix2 <- cor(X)
KMO(r=datamatrix2)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = datamatrix2)
Overall MSA =  0.65
MSA for each item = 
   ProdQual        Ecom     TechSup     CompRes Advertising 
       0.51        0.63        0.52        0.79        0.78 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
       0.62        0.62        0.75        0.51        0.76 
   DelSpeed 
       0.67 
print(cortest.bartlett(datamatrix2, nrow(prodsurvey[,-1])))
$chisq
[1] 619.2726

$p.value
[1] 1.79337e-96

$df
[1] 55

Hence Factor Analysis is considered as an appropriate technique for further analysis of the data.

Find the optimal clusters

Plotting the eigenvalues from our factor analysis (whether it’s based on principal axis or principal components extraction), a parallel analysis involves generating random correlation matrices and after factor analyzing them, comparing the resulting eigenvalues to the eigenvalues of the observed data. The idea behind this method is thatobserved eigenvalues that are higher than their corresponding random eigenvalues are more likely to be from “meaningful factors” than observed eigenvalues that are below their corresponding random eigenvalue.

parallel <- fa.parallel(X)

Parallel analysis suggests that the number of factors =  3  and the number of components =  3 
# other option of scree plot
scree(X)

Selection of factors from the scree plot can be based on:

  1. Kaiser-Guttman normalization rule says that we should choose all factors with an eigenvalue greater than 1.
  2. Bend elbow rule

When looking at the parallel analysis scree plots, there are two places to look depending on which type of factor analysis you’re looking to run. The two blue lines show you the observed eigenvalues - they should look identical to the scree plots drawn by the scree function. The red dotted lines show the random eigenvalues or the simulated data line. Each point on the blue line that lies above the corresponding simulated data line is a factor or component to extract. In this analysis, you can see that 4 factors in the “Factor Analysis” parallel analysis lie above the corresponding simulated data line and 3 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line.
In our case, however, the last factor/component lies very close to the line - for both principal components extraction and principal axis extraction. Thus, we should probably compare the 3 factor and 4 factor solutions, to see which one is most interpretable.

ev <- eigen(cor(X))
ev$values
 [1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378
 [7] 0.40151815 0.24695154 0.20355327 0.13284158 0.09842702
Factor <- rep(1:11)
Eigen_Values <- ev$values
Scree <- data.frame(Factor, Eigen_Values)

ggplot(data = Scree, mapping = aes(x = Factor, y = Eigen_Values)) + 
  geom_point() +
  geom_line() +
  scale_y_continuous(name = "Eigen Values", limits = c(0,4)) +
  theme_bw() + 
  theme(panel.grid.major.y = element_line(color = "skyblue")) +
  ggtitle("Scree Plot")

So as per the elbow or Kaiser-Guttman normalization rule, we are good to go ahead with 4 factors.

Conducting the Factor Analysis

  1. Factor analysis using fa method:
fa.none <-  fa(r=X, 
              nfactors = 4, 
#              covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="none") # none rotation
print(fa.none)
Factor Analysis using method =  pa
Call: fa(r = X, nfactors = 4, rotate = "none", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
              PA1   PA2   PA3   PA4   h2    u2 com
ProdQual     0.20 -0.41 -0.06  0.46 0.42 0.576 2.4
Ecom         0.29  0.66  0.27  0.22 0.64 0.362 2.0
TechSup      0.28 -0.38  0.74 -0.17 0.79 0.205 1.9
CompRes      0.86  0.01 -0.26 -0.18 0.84 0.157 1.3
Advertising  0.29  0.46  0.08  0.13 0.31 0.686 1.9
ProdLine     0.69 -0.45 -0.14  0.31 0.80 0.200 2.3
SalesFImage  0.39  0.80  0.35  0.25 0.98 0.021 2.1
ComPricing  -0.23  0.55 -0.04 -0.29 0.44 0.557 1.9
WartyClaim   0.38 -0.32  0.74 -0.15 0.81 0.186 2.0
OrdBilling   0.75  0.02 -0.18 -0.18 0.62 0.378 1.2
DelSpeed     0.90  0.10 -0.30 -0.20 0.94 0.058 1.4

                       PA1  PA2  PA3  PA4
SS loadings           3.21 2.22 1.50 0.68
Proportion Var        0.29 0.20 0.14 0.06
Cumulative Var        0.29 0.49 0.63 0.69
Proportion Explained  0.42 0.29 0.20 0.09
Cumulative Proportion 0.42 0.71 0.91 1.00

Mean item complexity =  1.9
Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are  55  and the objective function was  6.55 with Chi Square of  619.27
The degrees of freedom for the model are 17  and the objective function was  0.33 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic number of observations is  100 with the empirical chi square  3.19  with prob <  1 
The total number of observations was  100  with Likelihood Chi Square =  30.27  with prob <  0.024 

Tucker Lewis Index of factoring reliability =  0.921
RMSEA index =  0.088  and the 90 % confidence intervals are  0.032 0.139
BIC =  -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1  PA2  PA3  PA4
Correlation of (regression) scores with factors   0.98 0.97 0.95 0.88
Multiple R square of scores with factors          0.96 0.95 0.91 0.78
Minimum correlation of possible factor scores     0.92 0.90 0.82 0.56
  1. Factor analysis using the factanal method:
factanal.none <- factanal(X, factors=4, scores = c("regression"), rotation = "none")
print(factanal.none)

Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "none")

Uniquenesses:
   ProdQual        Ecom     TechSup     CompRes Advertising 
      0.682       0.360       0.228       0.178       0.679 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
      0.005       0.017       0.636       0.163       0.347 
   DelSpeed 
      0.076 

Loadings:
            Factor1 Factor2 Factor3 Factor4
ProdQual     0.467  -0.148  -0.229  -0.159 
Ecom                 0.791                 
TechSup      0.198          -0.482   0.707 
CompRes      0.589   0.316   0.535   0.300 
Advertising          0.556   0.110         
ProdLine     0.997                         
SalesFImage          0.987                 
ComPricing  -0.491   0.252   0.238         
WartyClaim   0.279   0.124  -0.487   0.712 
OrdBilling   0.452   0.275   0.493   0.360 
DelSpeed     0.629   0.364   0.582   0.241 

               Factor1 Factor2 Factor3 Factor4
SS loadings      2.522   2.316   1.469   1.322
Proportion Var   0.229   0.211   0.134   0.120
Cumulative Var   0.229   0.440   0.573   0.694

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113 
factanal.var <- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
print(factanal.var)

Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")

Uniquenesses:
   ProdQual        Ecom     TechSup     CompRes Advertising 
      0.682       0.360       0.228       0.178       0.679 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
      0.005       0.017       0.636       0.163       0.347 
   DelSpeed 
      0.076 

Loadings:
            Factor1 Factor2 Factor3 Factor4
ProdQual                             0.557 
Ecom                 0.793                 
TechSup                      0.872   0.102 
CompRes      0.884   0.142           0.135 
Advertising  0.190   0.521          -0.110 
ProdLine     0.502           0.104   0.856 
SalesFImage  0.119   0.974          -0.130 
ComPricing           0.225  -0.216  -0.514 
WartyClaim                   0.894   0.158 
OrdBilling   0.794   0.101   0.105         
DelSpeed     0.928   0.189           0.164 

               Factor1 Factor2 Factor3 Factor4
SS loadings      2.592   1.977   1.638   1.423
Proportion Var   0.236   0.180   0.149   0.129
Cumulative Var   0.236   0.415   0.564   0.694

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113 

Graph Factor Loading Matrices

Factor analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented an “interpretable” solution as a table of loadings or graphically, where all loadings with an absolute value \(\rightarrow\) some cut point are represented as an edge (path).

fa.diagram(fa.none)

fa.var <-  fa(r=X, 
              nfactors = 4, 
#              covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="varimax") # none rotation
print(fa.var)
Factor Analysis using method =  pa
Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
              PA1   PA2   PA3   PA4   h2    u2 com
ProdQual     0.02 -0.07  0.02  0.65 0.42 0.576 1.0
Ecom         0.07  0.79  0.03 -0.11 0.64 0.362 1.1
TechSup      0.02 -0.03  0.88  0.12 0.79 0.205 1.0
CompRes      0.90  0.13  0.05  0.13 0.84 0.157 1.1
Advertising  0.17  0.53 -0.04 -0.06 0.31 0.686 1.2
ProdLine     0.53 -0.04  0.13  0.71 0.80 0.200 1.9
SalesFImage  0.12  0.97  0.06 -0.13 0.98 0.021 1.1
ComPricing  -0.08  0.21 -0.21 -0.59 0.44 0.557 1.6
WartyClaim   0.10  0.06  0.89  0.13 0.81 0.186 1.1
OrdBilling   0.77  0.13  0.09  0.09 0.62 0.378 1.1
DelSpeed     0.95  0.19  0.00  0.09 0.94 0.058 1.1

                       PA1  PA2  PA3  PA4
SS loadings           2.63 1.97 1.64 1.37
Proportion Var        0.24 0.18 0.15 0.12
Cumulative Var        0.24 0.42 0.57 0.69
Proportion Explained  0.35 0.26 0.22 0.18
Cumulative Proportion 0.35 0.60 0.82 1.00

Mean item complexity =  1.2
Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are  55  and the objective function was  6.55 with Chi Square of  619.27
The degrees of freedom for the model are 17  and the objective function was  0.33 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic number of observations is  100 with the empirical chi square  3.19  with prob <  1 
The total number of observations was  100  with Likelihood Chi Square =  30.27  with prob <  0.024 

Tucker Lewis Index of factoring reliability =  0.921
RMSEA index =  0.088  and the 90 % confidence intervals are  0.032 0.139
BIC =  -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1  PA2  PA3  PA4
Correlation of (regression) scores with factors   0.98 0.99 0.94 0.88
Multiple R square of scores with factors          0.96 0.97 0.88 0.78
Minimum correlation of possible factor scores     0.93 0.94 0.77 0.55
fa.var$loadings

Loadings:
            PA1    PA2    PA3    PA4   
ProdQual                          0.647
Ecom                0.787        -0.113
TechSup                    0.883  0.116
CompRes      0.898  0.130         0.132
Advertising  0.166  0.530              
ProdLine     0.525         0.127  0.712
SalesFImage  0.115  0.971        -0.135
ComPricing          0.213 -0.209 -0.590
WartyClaim   0.103         0.885  0.128
OrdBilling   0.768  0.127              
DelSpeed     0.949  0.185              

                 PA1   PA2   PA3   PA4
SS loadings    2.635 1.967 1.641 1.371
Proportion Var 0.240 0.179 0.149 0.125
Cumulative Var 0.240 0.418 0.568 0.692
fa.diagram(fa.var)

Labeling and interpretation of the factors

Factors Variables Label Short Interpretaation
PA1 DelSpeed, CompRes, OrdBilling Purchase All the items are related to purchasing the product; from placing order to billing and getting it delivered.
PA2 SalesFImage, Ecom, Advertising Marketing In this factors the item are related to marketing processes like the image of sales force and spending on adverstising
PA3 WartyClaim, TechSup Post Purchase Post purchase activities are included in this factor; like warranty claims and technical support.
PA4 ProdLine, ProdQual, CompPricing Product Position Product positioning related items are grouped in this factor.

Scores for all the rows

head(fa.var$scores)
            PA1        PA2          PA3         PA4
[1,] -0.1338871  0.9175166 -1.719604873  0.09135411
[2,]  1.6297604 -2.0090053 -0.596361722  0.65808192
[3,]  0.3637658  0.8361736  0.002979966  1.37548765
[4,] -1.2225230 -0.5491336  1.245473305 -0.64421384
[5,] -0.4854209 -0.4276223 -0.026980304  0.47360747
[6,] -0.5950924 -1.3035333 -1.183019401 -0.95913571
#fa.var$scores # for 100 observations

Regression analysis using the factors scores as the independent variable

regdata <- cbind(prodsurvey["Satisfaction"], fa.var$scores)

#Labeling the data
names(regdata) <- c("Satisfaction", "Purchase", "Marketing",
                    "Post_purchase", "Prod_positioning")
head(regdata)
  Satisfaction   Purchase  Marketing Post_purchase Prod_positioning
1          8.2 -0.1338871  0.9175166  -1.719604873       0.09135411
2          5.7  1.6297604 -2.0090053  -0.596361722       0.65808192
3          8.9  0.3637658  0.8361736   0.002979966       1.37548765
4          4.8 -1.2225230 -0.5491336   1.245473305      -0.64421384
5          7.1 -0.4854209 -0.4276223  -0.026980304       0.47360747
6          4.7 -0.5950924 -1.3035333  -1.183019401      -0.95913571

Splitting the data to train and test set

#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]

Regression Model using train data

model.fa.score = lm(Satisfaction~., train)
summary(model.fa.score)

Call:
lm(formula = Satisfaction ~ ., data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76280 -0.48717  0.06799  0.46459  1.24022 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.91852    0.08068  85.757  < 2e-16 ***
Purchase          0.50230    0.07641   6.574 9.74e-09 ***
Marketing         0.75488    0.08390   8.998 5.00e-13 ***
Post_purchase     0.08755    0.08216   1.066    0.291    
Prod_positioning  0.58074    0.08781   6.614 8.30e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6629 on 65 degrees of freedom
Multiple R-squared:  0.7261,    Adjusted R-squared:  0.7093 
F-statistic: 43.08 on 4 and 65 DF,  p-value: < 2.2e-16

Check vif

vif(model.fa.score)
        Purchase        Marketing    Post_purchase Prod_positioning 
        1.009648         1.008235         1.015126         1.024050 

Check prediction of the model in the test dataset

#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test
       6        8       11       13       17       19       26 
4.975008 5.908267 6.951629 8.677431 6.613838 6.963113 6.313513 
      33       34       35       37       40       42       44 
6.141338 6.158993 7.415742 6.589746 6.858206 7.133989 8.533080 
      49       50       53       56       57       60       65 
8.765145 8.078744 7.395438 7.468360 8.744402 6.276660 5.936570 
      67       71       73       75       80       96       97 
6.650322 8.299545 7.685564 7.330191 6.719528 7.540233 6.143172 
      99      100 
8.084583 5.799897 
#Find MSE and MAPE scores:
#MSE/ MAPE of Model1
test$Satisfaction_Predicted <- pred_test
head(test[c("Satisfaction","Satisfaction_Predicted")], 10)
   Satisfaction Satisfaction_Predicted
6           4.7               4.975008
8           6.3               5.908267
11          7.4               6.951629
13          8.4               8.677431
17          6.4               6.613838
19          6.8               6.963113
26          6.6               6.313513
33          5.4               6.141338
34          7.3               6.158993
35          6.3               7.415742
test_r2 <- cor(test$Satisfaction, test$Satisfaction_Predicted) ^2

mse_test <- mse(test$Satisfaction, pred_test)
rmse_test <- sqrt(mse(test$Satisfaction, pred_test))
mape_test <- mape(test$Satisfaction, pred_test)
model_metrics <- cbind(mse_test,rmse_test,mape_test,test_r2)
print(model_metrics, 3)
     mse_test rmse_test mape_test test_r2
[1,]    0.551     0.742    0.0875   0.554

As the feature Post_purchase is not significant so we will drop this feature and then let’s run the regression model again.

##Regression model without post_purchase:
model2 <- lm(Satisfaction ~ Purchase+ Marketing+ 
                Prod_positioning, data= train)
summary(model2)

Call:
lm(formula = Satisfaction ~ Purchase + Marketing + Prod_positioning, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7268 -0.5036  0.1117  0.4408  1.2962 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.92158    0.08071  85.761  < 2e-16 ***
Purchase          0.49779    0.07637   6.518 1.15e-08 ***
Marketing         0.75870    0.08391   9.042 3.66e-13 ***
Prod_positioning  0.59084    0.08739   6.761 4.30e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6636 on 66 degrees of freedom
Multiple R-squared:  0.7213,    Adjusted R-squared:  0.7087 
F-statistic: 56.95 on 3 and 66 DF,  p-value: < 2.2e-16
pred_test2 <- predict(model2, newdata = test, type = "response")
pred_test2
       6        8       11       13       17       19       26 
5.069667 5.958765 7.001694 8.695630 6.521440 6.994018 6.249667 
      33       34       35       37       40       42       44 
6.123357 6.132165 7.415882 6.566556 6.964570 7.233376 8.424720 
      49       50       53       56       57       60       65 
8.853060 8.177009 7.413693 7.404078 8.746511 6.215816 5.893515 
      67       71       73       75       80       96       97 
6.476465 8.227837 7.839941 7.354907 6.734279 7.559995 6.293774 
      99      100 
8.083909 5.832094 
test$Satisfaction_Predicted2 <- pred_test2
head(test[c(1,7)], 10)
   Satisfaction Satisfaction_Predicted2
6           4.7                5.069667
8           6.3                5.958765
11          7.4                7.001694
13          8.4                8.695630
17          6.4                6.521440
19          6.8                6.994018
26          6.6                6.249667
33          5.4                6.123357
34          7.3                6.132165
35          6.3                7.415882
test_r22 <- cor(test$Satisfaction, test$Satisfaction_Predicted2) ^2
mse_test2 <- mse(test$Satisfaction, pred_test2)
rmse_test2 <- sqrt(mse(test$Satisfaction, pred_test2))
mape_test2 <- mape(test$Satisfaction, pred_test2)

model2_metrics <- cbind(mse_test2,rmse_test2,mape_test2,test_r22)
model2_metrics
     mse_test2 rmse_test2 mape_test2  test_r22
[1,] 0.5446039  0.7379728 0.08490197 0.5595099
Overall <- rbind(model_metrics,model2_metrics)
row.names(Overall) <- c("Test1", "Test2")
colnames(Overall) <- c("MSE", "RMSE", "MAPE", "R-squared")
print(Overall,3)
        MSE  RMSE   MAPE R-squared
Test1 0.551 0.742 0.0875     0.554
Test2 0.545 0.738 0.0849     0.560

Try the interaction?!

Including Interaction model, we are able to make a better prediction!??

Conclusion

We saw how Factor Analysis can be used to reduce the dimensionality of a dataset and then we used multiple linear regression on the dimensionally reduced columns/Features for further analysis/predictions.

Further Reading

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2022, Sept. 30). HaiBiostat: Factor Analysis applied in Multivariate Regression: Machine Learning's Perspective. Retrieved from https://hai-mn.github.io/posts/2022-09-30-FA in Multivariate Regression/

BibTeX citation

@misc{nguyen2022factor,
  author = {Nguyen, Hai},
  title = {HaiBiostat: Factor Analysis applied in Multivariate Regression: Machine Learning's Perspective},
  url = {https://hai-mn.github.io/posts/2022-09-30-FA in Multivariate Regression/},
  year = {2022}
}